import pyreadr
import sklearn
import pandas as pd
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
import shap
import numpy as np
shap.initjs()
from lime.lime_tabular import LimeTabularExplainer
train = pyreadr.read_r('hmc_train.Rda')['train']
valid = pyreadr.read_r('hmc_valid.Rda')['valid']
assert set(train.columns) == set(valid.columns)
column_names = sorted(train.columns)
train = train[column_names]
valid = valid[column_names]
column_names.remove("PURCHASE")
def xy_split(data, y_name="PURCHASE"):
return data.drop([y_name], axis=1).to_numpy(), data[y_name].to_numpy()
X_train, Y_train = xy_split(train)
X_valid, Y_valid = xy_split(valid)
xgmodel = XGBClassifier(max_depth=13,
objective='binary:logistic',
gamma=0.1)
xgmodel.fit(X_train, Y_train, verbose=True)
xg_predictions = xgmodel.predict(X_train)
valid_score = xgmodel.score(X_valid, Y_valid)
print("xgboost valid score {}".format(valid_score))
logmodel = LogisticRegression(max_iter=1000, solver='lbfgs', C=3)
logmodel.fit(X_train, Y_train)
log_predictions = xgmodel.predict(X_train)
log_score = logmodel.score(X_valid, Y_valid)
print("logistic valid score {}".format(log_score))
train.info()
print("Dataset bias: {}".format(np.average(Y_train)))
This is an uplift modelling dataset (package Information from R), with indicatior variable "TREATMENT" as a column (prepared for single-model approach) Data is highly biased, so above scores dont seem to be good (logistic regression have score close to answearing always no). Other thing is, column names lack explanation itself, so i will try to focus on columns which names are clear to understand.
explainer = LimeTabularExplainer(X_train, feature_names=column_names,
class_names=['NOT', 'BUY'], discretize_continuous=True)
def lime_explainer(X, Y, model, features=3):
model_explainer = explainer.explain_instance(X, model.predict_proba, num_features=features)
model_explainer.show_in_notebook(show_table=True, show_all=False)
proba = model.predict_proba([X])[0][1]
print("Purchase probability {}, Right answear {}\n".format(proba, "PURCHASE" if Y else "NO PURCHASE"))
explanations = model_explainer.as_list()
positive = sorted([(round(imp, 2), text) for text, imp in explanations if imp > 0], reverse=True)
negative = sorted([(-round(imp, 2), text) for text, imp in explanations if imp < 0], reverse=True)
positive = pd.DataFrame.from_records(positive, columns = ['Importance', 'Explanation'])
negative = pd.DataFrame.from_records(negative, columns = ['Importance', 'Explanation'])
print("Parameters in favor of PURCHASE:")
display(positive)
print("Parameters in favor of NO PURCHASE:")
display(negative)
index = 11
lime_explainer(X_valid[index], Y_valid[index], xgmodel, features=5)
This is the rare case of model predicting right answear of PURCHASE, taking into consideration structure of the dataset. Explanations however show, that this is an eazy task. Only one parameter is driving predition towards wrong answear "monthly mortage payment" with value of 414, probably meaning that client might be overloaded with payments, and not willing to invest in another one. But other parameters tipping the balance to PURCHASE option show, that probably client is wealthy (more then 7 open open revolving accounts, and have no limit on a premium credit card), so he can afford more expenses. Also, region A is probably poor, so people outside of it are more likely to buy.
lime_explainer(X_valid[index], Y_valid[index], logmodel, features=5)
Linear model, suprisingly is close to predicting a correct answear. Same parameter suggests "NO PURCHASE" connected to high mortage payment, but is also counterweighted in a more linear manner with parameters linked to wealth of this client (high credit card limits).
index = 2
lime_explainer(X_valid[index], Y_valid[index], xgmodel, features=5)
In this case, gradient boosting model is sure that this is a clear "not buying" case. We can see its non-linearity, because here no mortage payment (indicated by value zero) slightly suggests in favor of buying, instead of not buying, like in the previous case. But it is overweighted by no open revolving accounts (and in this case its limit equal to zero). Also here we can see, that 5 parameters cannot describe how this model works.
lime_explainer(X_valid[index], Y_valid[index], logmodel, features=5)
As we see, linear model cannot undestand non-linear relation with the mortage monthly payment, so here also it is driving predition towards "NO PURCHASE" (conversely to the previous model). This answear however is still "NO" because classifier is fitted to answear always so, because it couldnt understand the data.
Clients who we predicted to buy, but they didnt.
np.where(np.logical_and((xgmodel.predict(X_valid) != Y_valid), Y_valid == 0))
index = 9
lime_explainer(X_valid[index], Y_valid[index], xgmodel, features=5)
This is a very rare case of false-negative prediction by gradient boosting model (considering that positive answears make up 80% of the dataset). As previously, small number of open revolving accouts together with disputed accounts is considered a predisposition not to purchase (but interesting is, why it has to be positive?). But bigger impact here is from having no student account and not living in a poor region, together with no premium credit card (which is probably mistaken for having no limit). Colncluding, this prediction does not seem reasonable, because above odds for buying are quite common (unlike buyin itself).
lime_explainer(X_valid[index], Y_valid[index], logmodel, features=5)
Linear model of course do not have this false negative, because of answearing mostly always no. Interesting is, that the parameters in favor of buying here are different then in gradient boosting model, and considering how it works, model is still not sure about the answear.